Deriving a Web-Scale Common Sense Fact Database
نویسندگان
چکیده
The fact that birds have feathers and ice is cold seems trivially true. Yet, most machine-readable sources of knowledge either lack such common sense facts entirely or have only limited coverage. Prior work on automated knowledge base construction has largely focused on relations between named entities and on taxonomic knowledge, while disregarding common sense properties. In this paper, we show how to gather large amounts of common sense facts from Web n-gram data, using seeds from the ConceptNet collection. Our novel contributions include scalable methods for tapping onto Web-scale data and a new scoring model to determine which patterns and facts are most reliable. The experimental results show that this approach extends ConceptNet by many orders of magnitude at comparable levels of precision. Introduction Motivation. Roses are red, violets are blue. Facts of this sort seem trivially true. Yet, knowledge that humans take for granted on a daily basis is not readily available in computational systems. For several decades, the knowledge acquisition bottleneck has been a major impediment to the development of intelligent systems. If such knowledge was more easily accessible, applications could behave more in line with users’ expectations. For example, a mobile device could recommend nearby coffee shops rather than ice cream vendors when users desire warm beverages. A search engine would be able to suggest local supermarkets when a user wishes to buy soap. Further applications include query expansion (Hsu, Tsai, and Chen 2006), video annotation (Altadmri and Ahmed 2009), faceted search (Bast et al. 2007), and distance learning (Anacleto et al. 2006), among other things. Liu and Singh (2004) provide a survey of applications that have made use of explicit common sense facts. Previous work to formalize our commonsense understanding of the world has largely been centered around a) manual efforts, e.g. Cyc, SUMO, and WordNet, as well as resources like ConceptNet (Havasi, Speer, and Alonso 2007) that rely on crowd-sourcing, b) minimally supervised information extraction from text (Etzioni et al. 2005; Suchanek, Sozio, and Weikum 2009; Carlson et al. 2010). Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Both strategies are limited in scope or coverage. Human efforts are time-consuming and often fail to attract sufficient numbers of contributors. Information extraction methods have been successful for taxonomic knowledge (IsA and InstanceOf relationships among classes and between entities and classes), and for relations between named entities (e.g. birthplaces of people). Most extraction systems rely on pattern matching, e.g. a string like “. . . cities such as Paris . . . ” matching the “ such as ” pattern for the IsA relation leads to knowledge of the form IsA(Paris,City). Previous work (Hearst 1992) has shown that textual patterns can be surprisingly reliable but are generally very rare. For instance, in a 20 million word New York Times article collection, Hearst found only 46 facts. Contribution. This paper explores how large numbers of common sense properties like CapableOf(dog,bark), PartOf(room,house) can be harvested automatically from the Web. A new strategy is proposed to overcome the robustness and scalability challenges of previous work. • Rather than starting out with minimal numbers of seeds, we exploit information from an existing fact database, ConceptNet (Havasi, Speer, and Alonso 2007). • Rather than using a text corpus, we rely on a Web-scale n-gram dataset, which gives us a synopsis of a significant fraction of all text found on the Web. While people rarely explicitly express the obvious, we believe that “a word is characterized by the company it keeps” (Firth 1957) and exploit the very large quantities of natural language text that are now available on the Web. • Unlike standard bootstrapping approaches, we rely on novel scoring functions to very carefully determine which patterns are likely to lead to good extractions. • Unlike previous unsupervised outputs, we rely on a semisupervised approach for scoring the output facts. The model is obtained from the input data, without any need for additional manual labelling.
منابع مشابه
Deriving a Web-Scale Common Sense Fact Knowledge Base
The fact that birds have feathers and ice is cold seems trivially true. Yet, most machine-readable sources of knowledge either lack such common sense facts entirely or have only limited coverage. Prior work on automated knowledge base construction has largely focused on relations between named entities and on taxonomic knowledge, while disregarding common sense properties. Extracting such struc...
متن کاملSensor-Based Understanding of Daily Life via Large-Scale Use of Common Sense
The use of large quantities of common sense has long been thought to be critical to the automated understanding of the world. To this end, various groups have collected repositories of common sense in machinereadable form. However, efforts to apply these large bodies of knowledge to enable correspondingly largescale sensor-based understanding of the world have been few. Challenges have included...
متن کاملLarge Scale Use of Common Sense for Activity Recognition and Analysis
The problem of human activity recognition has attracted increasing interest in recent years. Systems capable of recognizing activities of daily living (ADLs) would be of use in assisting in human monitoring; for example, those suffering from cognitive disorders could be more easily supervised or assisted with such systems. Here we propose a model for a system designed to (1) recognize ADLs base...
متن کاملهمپوشانی سنتی و نسبی پایگاه های اطلاعاتی Scopus و Web of Sciences در حوزه بیماریهای غدد درونریز
Introduction: This study aimed to determine the traditional and relative overlap between Scopus and Web of Science databases in Endocrine System Diseases. Methods: This research is a descriptive survey and an applied study. Research population includes all articles retrieved from Scopus and Web of Science databases. 11 Descriptors and 120 sub-heading were searched in endocrine field in 2009....
متن کاملLarge Scale Matching Issues and Advances
Recently, we are witnessing an explosive growth of data in the business and scientific area. In fact, there are many databases and information sources available through the web covering different domains: semantic Web, deep Web, e-business, biology, digital libraries, etc. In such domains, the data generated are heterogeneous and voluminous e.g schemas with several thousand elements are common ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011